$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\piv}{\mathbf{\pi}} \newcommand{\yv}{\mathbf{y}} \newcommand{\Yv}{\mathbf{Y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\gv}{\mathbf{g}} \newcommand{\Hv}{\mathbf{H}} \newcommand{\dv}{\mathbf{d}} \newcommand{\Vv}{\mathbf{V}} \newcommand{\vv}{\mathbf{v}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\Sv}{\mathbf{S}} \newcommand{\zv}{\mathbf{z}} \newcommand{\Zv}{\mathbf{Z}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}} $
Hierarchical clustering is often used to construct dendrograms. We will see an example below.
The methods are straightforward. The similarity between pairs of samples is usually related to the Euclidean distance between them. In agglomerative clustering, initially each sample is in a unique cluster. Then, the most similar two clusters are merged. This continues until a single cluster results that contains all samples. The distance between two clusters, $C_i$ and $C_j$, can be determined by the single-link method $$ d(C_i,C_j) = \min_{\xv\in C_i, \yv\in C_j} d(\xv,\yv) $$ or complete-link method $$ d(C_i,C_j) = \max_{\xv\in C_i, \yv\in C_j} d(\xv,\yv) $$ where $d(\xv,\yv)$ is the Euclidean distance between $\xv$ and $\yv$.
In divisive clustering, all samples are initially in one cluster, which is successively split until all samples are in unique clusters. We will use agglomerative clustering, as it often results in more compact dendrograms.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Let's represent clusters as a list of sample matrices, each matrix containing samples from one cluster. Initially, all samples are in their own clusters. Let's use the Old Faithful data to develop our implementation.
In [2]:
!wget http://www.cs.colostate.edu/~anderson/cs480/notebooks/oldfaithful.csv
In [3]:
data = np.loadtxt('oldfaithful.csv')
In [4]:
data.shape
Out[4]:
In [5]:
plt.scatter(data[:,0],data[:,1]);
plt.xlabel('Duration');
plt.ylabel('Interval');
In [6]:
clusters = [d for d in data]
clusters[:5]
Out[6]:
Now we need the complete-linkage cluster distance function.
In [7]:
C1 = [np.array([[1,2]]), np.array([[2,3]])]
C2 = [np.array([[2,2]]), np.array([[6,7]])]
C1,C2
Out[7]:
In [8]:
allC1 = np.vstack((C1))
allC2 = np.vstack((C2))
allC1,allC2
Out[8]:
In [9]:
allC1[:,np.newaxis,:] - allC2
Out[9]:
In [10]:
np.sum((allC1[:,np.newaxis,:] - allC2)**2,axis=2)
Out[10]:
In [11]:
np.max(np.sum((allC1[:,np.newaxis,:] - allC2)**2,axis=2))
Out[11]:
So, the maximum square distance between $C_1$ and $C_2$ is 50.
In [12]:
def clusterDistance(Ci,Cj):
allCi = np.vstack((Ci))
allCj = np.vstack((Cj))
return np.max(np.sum((allCi[:,np.newaxis,:] - allCj)**2, axis=2))
In [13]:
clusterDistance(C1,C2)
Out[13]:
All that is left is a way to identify to two clusters with the minimum distance.
In [14]:
C3 = [np.array([[6,4]]), np.array([[8,9]])]
In [15]:
clusters = [C1, C2, C3]
clusters
Out[15]:
In [16]:
for i in range(len(clusters)-1):
for j in range(i+1,len(clusters)):
print(i,j)
In [17]:
dists = []
for i in range(len(clusters)-1):
for j in range(i+1,len(clusters)):
dists.append([i,j,clusterDistance(clusters[i],clusters[j])])
dists
Out[17]:
or
In [18]:
[[i,j,clusterDistance(clusters[i],clusters[j])] for i in range(len(clusters)-1) for j in range(i+1,len(clusters))]
Out[18]:
In [71]:
def clusterDistance(Ci,Cj):
'''Ci and Cj are two clusters, each being a dict with 'X' and 'label' keys'''
return np.mean(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
# return np.min(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
# return np.max(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
So, clusters at indices 0 and 1 are closest. We can merge these two using np.vstack
. Now we are ready to write the function.
In [72]:
def mergeClusters(Ci,Cj, k):
return {'X': np.vstack((Ci['X'], Cj['X'])),
'label': k}
Now for a simple, but very inefficient, implementation of agglomerative clustering.
In [73]:
def agglomerative(X,clusterDistanceF, nClusters):
labels = np.zeros((X.shape[0]))
# clusters is list of pairs of sample and label
clusters = [ {'X':X[i:i+1,:], 'label':i} for i in range(X.shape[0]) ]
k = X.shape[0] - 1
while len(clusters) > nClusters:
dists = np.array( [[i,j,clusterDistance(clusters[i],clusters[j])] for i in range(len(clusters)-1) for j in range(i+1,len(clusters))] )
whichClosest = np.argmin(dists[:,-1])
closest = dists[whichClosest,:2]
i,j = closest.astype(int)
# Merge them
k += 1
clusters[i] = {'X': np.vstack((clusters[i]['X'],clusters[j]['X'])),
'label': k}
clusters.pop(j)
print(len(clusters), end=' ')
return clusters
In [74]:
data.shape
Out[74]:
In [75]:
clusters = agglomerative(data,clusterDistance, 2)
In [76]:
clusters
Out[76]:
In [77]:
for i in range(len(clusters)):
cluster = clusters[i]['X']
plt.scatter(cluster[:,0], cluster[:,1])
plt.xlabel('Duration');
plt.ylabel('Interval');
How might we make this more efficient?
Maybe if we compute the pairwise squared distance between data points once! Then clusters are defined by indices into this distance matrix.
In [79]:
dataDists = np.sum((data[:,np.newaxis,:] - data)**2, axis=2)
dataDists.shape
Out[79]:
In [80]:
def clusterDistance(Ci, Cj, dataDists):
'''Ci and Cj are two clusters, each being a dict with 'X' and 'label' keys'''
return np.mean( np.array([dataDists[i,j] for i in Ci['X'] for j in Cj['X']]) )
# return np.min(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
# return np.max(np.sum((Ci['X'][:,np.newaxis,:] - Cj['X'])**2, axis=2))
In [81]:
def agglomerative(X,clusterDistanceF, nClusters):
dataDists = np.sum((X[:,np.newaxis,:] - X)**2, axis=2)
labels = np.zeros((X.shape[0]))
# clusters is list of pairs of sample and label
clusters = [ {'X':[i], 'label':i} for i in range(X.shape[0]) ]
k = X.shape[0] - 1
while len(clusters) > nClusters:
dists = np.array( [[i,j,clusterDistance(clusters[i],clusters[j], dataDists)] for i in range(len(clusters)-1) for j in range(i+1,len(clusters))] )
whichClosest = np.argmin(dists[:,-1])
closest = dists[whichClosest,:2]
i,j = closest.astype(int)
# Merge them
k += 1
clusters[i] = {'X': clusters[i]['X'] + clusters[j]['X'],
'label': k}
clusters.pop(j)
print(len(clusters), end=' ')
return clusters
In [82]:
clusters = agglomerative(data,clusterDistance, 2)
In [85]:
for i in range(len(clusters)):
cluster = clusters[i]['X']
coords = np.array([data[c] for c in cluster])
plt.scatter(coords[:,0], coords[:,1])
plt.xlabel('Duration');
plt.ylabel('Interval');
In [ ]:
What else could you do to speed this up?
In [ ]:
Let's try another data set, this time from Finland.
In [67]:
data2 = np.loadtxt('userslocations.csv')
In [68]:
data2.shape
Out[68]:
In [ ]:
clusters = agglomerative(data2,clusterDistance, 4)
plt.figure(figsize=(20,8))
for i in range(len(clusters)):
cluster = clusters[i]['X']
coords = np.array([data[c] for c in cluster])
plt.scatter(coords[:,0], coords[:,1])
plt.xlabel('Interval (minutes)')
plt.ylabel('Duration (minutes)')
plt.subplot(1,3,2);
In [ ]:
In [ ]: